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Abstract 

Despite its nonconvex nature, Iq sparse approximation 
is desirable in many theoretical and application cases. 

We study the Iq sparse approximation problem with the 
tool of deep learning, by proposing Deep Encoders. 

Two typical forms, the regularized problem and the 
M-sparse problem, are investigated. Based on solid iter¬ 
ative algorithms, we model them as feed-forward neural 
networks, through introducing novel neurons and pool¬ 
ing functions. Enforcing such structural priors acts as 
an effective network regularization. The deep encoders 
also enjoy faster inference, larger learning capacity, and 
better scalability compared to conventional sparse cod¬ 
ing solutions. Eurthermore, under task-driven losses, the 
models can be conveniently optimized from end to end. 
Numerical results demonstrate the impressive perfor¬ 
mances of the proposed encoders. 
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Introduction 

Sparse signal approximation has gained popularity over the 
last decade. The sparse approximation model suggests that 
a natural signal could be compactly approximated, by only 
a few atoms out of a properly given dictionary, where the 
weights associated with the dictionary atoms are called the 
sparse codes. Proven to be both robust to noise and scal¬ 
able to high dimensional data, sparse codes are known as 
powerful features, and benefit a wide range of signal pro¬ 
cessing applications, such as source coding (Donoho et al. 
1998), denoising (Donoho 1995), source separation (Davies 
and Mitianoudis 2004), pattern classification (Wright et al. 
2009), and clustering (Cheng et al. 2010). 

We are particularly interested in the -based sparse ap¬ 
proximation problem, which is the fundamental formulation 
of sparse coding (Donoho and Elad 2003). The nonconvex 
^0 problem is intractable and often instead attacked by mini¬ 
mizing surrogate measures, such as the ^i-norm, which leads 
to more tractable computational methods. However, it has 
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been both theoretically and practically discovered that solv¬ 
ing ^0 sparse approximation is still preferable in many cases. 

More recently, deep learning has attracted great attentions 
in many feature learning problems (Krizhevsky, Sutskever, 
and Hinton 2012). The advantages of deep learning lie in its 
composition of multiple non-linear transformations to yield 
more abstract and descriptive embedding representations. 
With the aid of gradient descent, it also scales linearly in 
time and space with the number of train samples. 

It has been noticed that sparse approximation and deep 
learning bear certain connections (Gregor and LeCun 2010). 
Their similar methodology has been lately exploited in 
(Hershey, Roux, and Weninger 2014), (Sprechmann et al. 
2013), (Sprechmann, Bronstein, and Sapiro 2015). By turn¬ 
ing sparse coding models into deep networks, one may ex¬ 
pect faster inference, larger learning capacity, and better 
scalability. The network formulation also facilitates the in¬ 
tegration of task-driven optimization. 

In this paper, we investigate two typical forms of -based 
sparse approximation problems: the regularized problem, 
and the M-sparse problem. Based on solid iterative algo¬ 
rithms (Blumensath and Davies 2008), we formulate them 
as feed-forward neural networks (Gregor and LeCun 2010), 
called Deep Iq Encoders, through introducing novel neu¬ 
rons and pooling functions. We study their applications in 
image classification and clustering; in both cases the models 
are optimized in a task-driven, end-to-end manner. Impres¬ 
sive performances are observed in numerical experiments. 

Related Work 

^0 and -based Sparse Approximations 

Finding the sparsest, or minimum ^o-norm, representation 
of a signal given a dictionary of basis atoms is an impor¬ 
tant problem in many application domains. Consider a data 
sample x G that is encoded into its sparse code 

a G using a learned dictionary D = [di, d 2 , • • • , d^], 
where d^ G = 1, 2, • • • are the learned atoms. 

The sparse codes are obtained by solving the regularized 
problem (A is a constant): 

a = argmina lllx - Da||| + A||a||o. (1) 

Alternatively, one could explicitly impose constraints on the 
number of non-zero coefficients of the solution, by solving 



the M-sparse problem: 

a = argmina ||x — Da|||. s.t. ||a||o < M (2) 

Unfortunately, these optimization problems are often in¬ 
tractable because there is a combinatorial increase in the 
number of local minima as the number of the candidate basis 
vectors increases. One potential remedy is to employ a con¬ 
vex surrogate measure, such as the ^i-norm, in place of the 
^o-norm that leads to a more tractable optimization problem. 
For example, (1) could be relaxed as: 

a = argmina ^||x — Da|||. + A||a||i. (3) 

It creates a unimodal optimization problem that can be 
solved via linear programming techniques. The downside is 
that we have now introduced a mismatch between the ulti¬ 
mate goal and the objective function (Wipf and Rao 2004). 
Under certain conditions, the minimum ^i-norm solution 
equals to the minimum ^o-norm one (Donoho and Elad 
2003). But in practice, the £i approximation is often used 
way beyond these conditions, and is thus quite heuristic. As 
a result, we often get a solution which is not exactly mini¬ 
mizing the original ^o-norm. 

That said, ii approximation is found to work practically 
well for many sparse coding problems. Yet in certain appli¬ 
cations, we intend to control the exact number of nonzero 
elements, such as basis selection (Wipf and Rao 2004), 
where Iq approximation is indispensable. Beyond that, Iq- 
approximation are desirable for performance concerns in 
many ways. In compressive sensing literature, empirical evi¬ 
dence (Candes, Wakin, and Boyd 2008) suggested that using 
an iterative reweighted ii scheme to approximate the £o so¬ 
lution often improved the quality of signal recovery. In im¬ 
age enhancement, it was shown in (Yuan and Ghanem 2015) 
that £o data fidelity was more suitable for reconstructing im¬ 
ages corrupted with impulse noise. For the purpose of image 
smoothening, the authors of (Xu et al. 2011) utilized £o gra¬ 
dient minimization to globally control how many non-zero 
gradients to approximate prominent structures in a structure- 
sparsity-management manner. Recent work (Wang, Wang, 
and Singh 2015) revealed that £o sparse subspace cluster¬ 
ing can completely characterize the set of minimal union-of- 
subspace structure, without additional separation conditions 
required by its £i counterpart. 

Network Implementation of -Approximation 


the sparse code a is obtained by solving (3) for a given dic¬ 
tionary D in advance. The network has a finite number of 
stages, each of which updates the intermediate sparse code 
(k= 1,2) according to 

_ ^^(Wx + Sz^), (4) 

where 5^ is an element-wise shrinkage function (u is a vec¬ 
tor and Ui is its i-th element, i = 1, 2, ...,p): 

[S 0 {u)]i = sign(ui)(|ui| - 6»i)+. (5) 

The parameterized encoder, named learned ISTA (LISTA), 
is a natural network implementation of the iterative shrink¬ 
age and thresholding algorithm (ISTA). LISTA learned all 
its parameters W, S and 0 from training data using a back- 
propagation algorithm (LeCun et al. 2012). In this way, a 
good approximation of the underlying sparse code can be 
obtained after a fixed small number of stages. 

In (Sprechmann et al. 2013), the authors leveraged a 
similar idea on fast trainable regressors and constructed 
feed-forward network approximations of the learned sparse 
models. Such a process-centric view was later extended in 
(Sprechmann, Bronstein, and Sapiro 2015) to develop a 
principled process of learned deterministic fixed-complexity 
pursuits, in lieu of iterative proximal gradient descent al¬ 
gorithms, for structured sparse and robust low rank mod¬ 
els. Very recently, (Hershey, Roux, and Weninger 2014) fur¬ 
ther summarized the methodology of the problem-level and 
model-based “deep unfolding”, and developed new architec¬ 
tures as inference algorithms for both Markov random fields 
and non-negative matrix factorization. Our work shares the 
similar spirit with those prior wisdoms, yet studies the unex¬ 
plored £o problems with further insights obtained. 


Deep £o Encoders 
Deep ^0-Regularized Encoder 

To solve the optimization problem in (1), an iterative hard- 
thresholding (IHT) algorithm was derived in (Blumensath 
and Davies 2008): 

a''+i = /i;,o.5(a'' + D'r(x-Da'')), (6) 


where denotes the intermediate result of the k-th itera¬ 
tion, and ho is an element-wise hard thresholding operator: 


[^A0-5(u)]i 


0 if |ui|<A0-^ 

Ui if |ui|>A^-^ 


(7) 
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Figure 1: A LISTA network (Gregor and LeCun 2010) with Figure 2: The block diagram of solving (6). 

two time-unfolded stages. 


In (Gregor and LeCun 2010), a feed-forward neural net¬ 
work, as illustrated in Fig. 1, was proposed to efficiently ap¬ 
proximate the -based sparse code a of the input signal x; 


Eqn. (6) could be alternatively rewritten as: 

a'^+i = hg{MVyi+ Sa^), 

W = D^, S = I - D^D, e = A°-®, 
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Figure 3: Deep -Regularized Encoder, with two time-unfolded stages. 
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Figure 4: Deep M-sparse Encoder, with two time-unfolded stages. 


and expressed as the block diagram in Fig. 2, which outlines 
a recurrent network form of solving (6). 

By time-unfolding and truncating Fig. 2 to a fixed number 
of K iterations (Ff = 2 in this paper by default)^, we obtain a 
feed-forward network structure in Fig. 3, where W, S and 0 
are shared among both stages, named Deep -Regularized 
Encoder, Furthermore, W, S and 0 are all to be learnt, in¬ 
stead of being directly constructed from any pre-computed 
D. Although the equations in (8) do not directly apply any 
more to solving the Deep -Regularized Encoder, they can 
usually serve as a high-quality initialization of the latter. 

Note that the activation thresholds 0 are less straightfor¬ 
ward to update. We rewrite (5) as: [ho{u)]i = 0ihi{ui/0i). 
It indicates that the original neuron with trainable thresh¬ 
olds can be decomposed into two linear scaling layers, plus 
a unit-hard-threshold neuron, the latter of which is called 
Hard thrEsholding Linear Unit (HELU) by us. The weights 
of the two scaling layers are diagonal matrices defined by 0 
and its element-wise reciprocal, respectively. 

Discussion on HELU While being inspired by LISTA, the 
differentiating point of Deep -Regularized Encoder lies in 
the HELU neuron. Compared to classical neuron functions 
such as logistic, sigmoid, and ReLU (Mhaskar and Micchelli 
1994), as well as the soft shrinkage and thresholding opera¬ 
tion (5) in LISTA, HELU does not penalize large values, yet 
enforces strong (in theory infinite) penalty over small values. 
As such, HELU tends to produce highly sparse solutions. 

The neuron form of LISTA (5) could be viewed as a 
double-sided and translated variant of ReLU, which is con¬ 
tinuous and piecewise linear. In contrast, HELU is a dis¬ 
continuous function that rarely occurs in existing deep net¬ 
work neurons. As pointed out by (Homik, Stinchcombe, and 
White 1989), HELU has countably many discontinuities and 
is thus (Borel) measurable, in which case the universal ap¬ 


proximation capability of the network is not compromised. 
However, experiments remind us that the algorithmic leam- 
ability with such discontinuous neurons (using popular first- 
order methods) is in question, and the training is in general 
hard. Lor computation concerns, we replace HELU with the 
following continuous and piecewise linear function HELUcr, 
during network training: 


[HELU,(u)]i = ( 


0 

i(ui - 1 + ct) 
i(Uj + l-(7) 


Ui 


if |Ui| < 1 - cr 
if 1 — cr < Ui < 1 
if — l<Ui<cr — 1 
if |ui| > 1 


(9) 

Obviously, HELUcr becomes HELU when a 0. To ap¬ 
proximate HELU, we tend to choose very small a, while 
avoiding putting the training ill-posed. As a practical strat¬ 
egy, we start with a moderate a (0.2 by default), and divide it 
by 10 after each epoch. After several epoches, HELUcr turns 
very close to the ideal HELU. 

In (Rozell et al. 2008), the authors introduced an ideal 
hard thresholding function for solving sparse coding, whose 
formulation was close to HELU. Note that (Rozell et al. 
2008) approximates the ideal function with a sigmoid func¬ 
tion, which has connections with our HELUcr approxima¬ 
tion. In (Konda, Memisevic, and Krueger 2014), a similar 
truncated linear ReLU was utilized in the networks. 


Deep M-Sparse Encoder 

Both the ^0 regularized problem in (1) and Deep ^q- 
Regularized Encoder have no explicit control on the spar¬ 
sity level of the solution. One would therefore turn to the 
M-sparse problem in (2), and derive the following iterative 
algorithm (Blumensath and Davies 2008): 

+ D^(x — Da^)). (10) 


^ We test larger K values (3 or 4). In several cases they do bring 
performance improvements, but add complexity too.. 


Eqn. (10) resembles (6), except that hu is now a non¬ 
linear operator retaining the M coefficients with the top M - 










































largest absolute values. Following the same methodology 
as in the previous section, the iterative form could be time- 
unfolded and truncated to the Deep M-sparse Encoder, as 
in Fig. 4. To deal with the Hm operation, we refer to the pop¬ 
ular concepts of pooling and unpooling (Zeiler, Taylor, and 
Fergus 2011) in deep networks, and introduce the pairs of 
maxM pooling and unpooling, in Fig. 4. 

Discussion on maxM pooling/unpooling Pooling is popu¬ 
lar in convolutional networks to obtain translation-invariant 
features (Krizhevsky, Sutskever, and Hinton 2012). It is yet 
less common in other forms of deep networks (Gulcehre 
et al. 2014). The unpooling operation was introduced in 
(Zeiler, Taylor, and Fergus 2011) to insert the pooled values 
back to the appropriate locations of feature maps for recon¬ 
struction purposes. 

In our proposed Deep M-sparse Encoder, the pooling and 
unpooling operation pair is used to construct a projection 
from to its subset S : {s e 11 l^l |o < M}. The maxM 
pooling and unpooling functions are intuitively defined as: 

[Pm, idxM] = maxM.pooling(u) 

um = niaxM-unpooling(pM, idxM) ^ 

For each input u, the pooled map pm records the top M- 
largest values (irrespective of sign), and the switch idxM 
records their locations. The corresponding unpooling opera¬ 
tion takes the elements in pm and places them in um at the 
locations specified by idxM, the remaining elements being 
set to zero. The resulting um is of the same dimension as u 
but has exactly no more than M non-zero elements. In back 
propagation, each position in idxM is propagated with the 
entire error signal. 

Theoretical Properties 

It is showed in (Blumensath and Davies 2008) that the iter¬ 
ative algorithms in both (6) and (10) are guaranteed not to 
increase the cost functions. Under mild conditions, their tar¬ 
geted fixed points are local minima of the original problems. 
As the next step after the time truncation, the deep encoder 
models are to be solved by the stochastic gradient descent 
(SGD) algorithm, which converges to stationary points un¬ 
der a few stricter assumptions than ones satisfied in this pa¬ 
per (Bottou 2010) However, the entanglement of the iter¬ 
ative algorithms and the SGD algorithm makes the overall 
convergence analysis a serious hardship. 

One must emphasize that in each step, the back propaga¬ 
tion procedure requires only operations of order 0(p) (Gre¬ 
gor and LeCun 2010). The training algorithm takes 0(Cnp) 
time (C is the constant absorbing epochs, stage numbers, 
etc). The testing process is purely feed-forward and is there¬ 
fore dramatically faster than traditional inference methods 
by solving (1) or (2). SGD is also easy to be parallelized. 

Task-Driven Optimization 

It is often desirable to jointly optimize the learned sparse 
code features and the targeted task so that they mutually 
reinforce each other. The authors of (Jiang, Lin, and Davis 

^ As a typical case, we use SGD in a setting where it is not guar¬ 
anteed to converge in theory, but behaves well in practice. 


2011) associated label information with each dictionary item 
by adding discriminable regularization terms to the objec¬ 
tive. Recent work (Mairal, Bach, and Ponce 2012), (Wang 
et al. 2015a) developed task-driven sparse coding via bi¬ 
level optimization models, where (^i-based) sparse coding 
is formulated as the lower-level constraint while a task- 
oriented cost function is minimized as its upper-level objec¬ 
tive. The above approaches in sparse coding are complicated 
and computationally expensive. It is much more convenient 
to implement end-to-end task-driven training in deep archi¬ 
tectures, by concatenating the proposed deep encoders with 
certain task-driven loss functions. 

In this paper, we mainly discuss two tasks: classifica¬ 
tion and clustering, while being aware of other immediate 
extensions, such as semi-supervised learning. Assuming K 
classes (or clusters), and u? = [c^i,..., as the set of pa¬ 
rameters of the loss function, where LVi corresponds to the j- 
th class (cluster), j = 1,2,..., AT. For the classification case, 
one natural choice is the well-known softmax loss function. 
For the clustering case, since the true cluster label of each x 
is unknown, we define the predicted confidence probability 
Pj that sample x belongs to cluster j, as the likelihood of 
softmax regression: 

Pj =p{j\<^,a) = ^ ""Aa - (12) 

1^1 = 1 ^ ^ 

The predicted cluster label of a is the cluster j where it 
achieves the largest pj . 

Experiment 

Implementation 

Two proposed deep Iq encoders are implemented with the 
CUDA ConvNet package (Krizhevsky, Sutskever, and Hin¬ 
ton 2012). We use a constant learning rate of 0.01 with 
no momentum, and a batch size of 128. In practice, given 
that the model is well initialized, the training takes approxi¬ 
mately 1 hour on the MNIST dataset, on a workstation with 
12 Intel Xeon 2.67GHz CPUs and 1 GTX680 GPU. It is also 
observed that the training efficiency of our model scales ap¬ 
proximately linearly with the size of data. 

While many neural networks train well with random ini¬ 
tializations without pre-training, given that the training data 
is sufficient, it has been discovered that poorly initialized 
networks can hamper the effectiveness of first-order meth¬ 
ods (e.g., SGD) (Sutskever et al. 2013). For the proposed 
models, it is however much easier to initialize the model in 
the right regime, benefiting from the analytical relationships 
between sparse coding and network hyperparameters in (8). 

Simulation on Sparse Approximation 

We first compare the performance of different methods on 
io sparse code approximation. The first 60,000 samples of 
the MNIST dataset are used for training and the last 10,000 
for testing. Each patch is resized to 16 x 16 and then pre- 
processed to remove its mean and normalize its variance. 
The patches with small standard deviations are discarded. 
A sparsity coefficient A = 0.5 is used in (1), and the spar¬ 
sity level M = 32 is fixed in (2). The sparse code dimension 
(dictionary size) p is to be varied. 




Table 1: Prediction error (%) comparison of all methods on 
solving the -regularized problem (1) 


Method 

p = 128 

p = 256 

p = 5l2 

Iterative (2 iterations) 

17.52 

18.73 

22.40 

Iterative (5 iterations) 

8.14 

6.75 

9.37 

Iterative (10 iterations) 

3.55 

4.33 

4.08 

Baseline Encoder 

8.94 

8.76 

10.17 

Deep ^ 0 -Regularized Encoder 

0.92 

0.91 

0.81 


Our prediction task resembles the setup in (Gregor and 
LeCun 2010): first learning a dictionary from training data, 
following by solving sparse approximation (3) with respect 
to the dictionary, and finally training the network as a re¬ 
gressor from input samples to the solved sparse codes. The 
only major difference here lies in that unlike the -based 
problems, the non-convex -based minimization could only 
reach a (non-unique) local minimum. To improve stability, 
we first solve the problems to obtain a good initializa¬ 
tion for ^0 problems, and then run the iterative algorithms 
(6) or (10) until convergence. The obtained sparse codes are 
called “optimal codes” hereinafter and used in both training 
and testing evaluation (as “groundtruth”). One must keep in 
mind that we are not seeking to produce approximate sparse 
code for all possible input vectors, but only for input vectors 
drawn from the same distribution as our training samples. 


Table 2: Prediction error (%) comparison of all methods on 
solving the M-sparse problem (2) 


Method 

p = 128 

p = 256 

p = 5l2 

Iterative (2 iterations) 

17.23 

19.27 

19.31 

Iterative (5 iterations) 

10.84 

12.52 

12.40 

Iterative (10 iterations) 

5.67 

5.44 

5.20 

Baseline Encoder 

14.04 

16.76 

12.86 

Deep M-Sparse Encoder 

2.94 

2.87 

3.29 


Table 3: Averaged non-zero support error comparison of all 
methods on solving the M-sparse problem (2) 


Method 

p= 128 

p = 256 

p = 512 

Iterative (2 iterations) 

10.8 

13.4 

13.2 

Iterative (5 iterations) 

6.1 

8.0 

8.8 

Iterative (10 iterations) 

4.6 

5.6 

5.3 

Deep M-Sparse Encoder 

2.2 

2.7 

2.7 


We compare the proposed deep encoders with the it¬ 
erative algorithms under different number of iterations. In 
addition, we include a baseline encoder into comparison, 
which is a fully-connected feed-forward network, consisting 
of three hidden layers of dimension p with ReLu neurons. 
The baseline encoder thus has the same parameter capacity 


as deep io encoders^. We apply dropout to the baseline en¬ 
coders, with the probabilities of retaining the units being 0.9, 
0.9, and 0.5. The proposed encoders do not apply dropout. 

The deep io encoders and the baseline encoder are first 
trained, and all are then evaluated on the testing set. We cal¬ 
culate the total prediction errors, i.e., the normalized squared 
errors between the optimal codes and the predicted codes, as 
in Tables 1 and 2. For the M-sparse case, we also compare 
their recovery of non-zero supports in Table 3, by counting 
the mismatched nonzero element locations between optimal 
and predicted codes (averaged on all samples). Immediate 
conclusions from the numerical results are as follows: 

• The proposed deep encoders have outstanding generaliza¬ 
tion performances, thanks to the effective regularization 
brought by their architectures, which are derived from 
specific problem formulations (i.e., (1) and (2)) as priors. 
The “general-architecture” baseline encoders, which have 
the same parameter complexity, appear to overfit the train¬ 
ing set and generalize much worse. 

• While the deep encoders only unfold two stages, they out¬ 
performs their iterative counterparts even when the later 
ones have passed 10 iterations. Meanwhile, the former en¬ 
joy much faster inference as being feed-forward. 

• The Deep -Regularized Encoder obtains a particularly 
low prediction error. It is interpretable that while the it¬ 
erative algorithm has to work with a fixed A, the Deep 
^ 0 -Regularized Encoder is capable of “fine-tuning” this 
hyper-parameter automatically (after diag(6>) is initialized 
from A), by exploring the training data structure. 

• The Deep M-Sparse Encoder is able to find the nonzero 
support with high accuracy. 

Applications on Classification 

Since the task-driven models are trained from end to end, 
no pre-computation of a is needed. Eor classification, 
we evaluate our methods on the MNIST dataset, and the 
AVIRIS Indiana Pines hyperspectral image dataset (see 
(Wang, Nasrabadi, and Huang 2015) for details). We com¬ 
pare our two proposed deep encoders with two competitive 
sparse coding-based methods: 1) task-driven sparse coding 
(TDSC) in (Mairal, Bach, and Ponce 2012), with the origi¬ 
nal setting followed and all parameters carefully tuned; 2) a 
pre-trained LISTA followed by supervised tuning with soft- 
max loss. Note that for Deep M-Sparse Encoder, M is not 
known in advance and has to be tuned. To our surprise, the 
fine-tuning of M is likely to improve the performances sig¬ 
nificantly, which is analyzed next. The overall error rates are 
compared in Tables 4 and 5. 

In general, the proposed deep Iq encoders provide su¬ 
perior results to the deep -based method (tuned LISTA). 
TDSC also generates competitive results, but at the cost of 
the high complexity for inference, i.e., solving conventional 
sparse coding. It is of particular interest to us that when sup¬ 
plied with specific M values, the Deep M-Sparse encoder 

^except for the “diag(^)” layers in Fig. 3, each of which con¬ 
tains only p free parameters. 







Table 4: Classification error rate (%) comparison of all meth¬ 
ods on the MNIST dataset 


Method 

00 

(N 

II 

a. 

p = 256 

p = 5l2 

TDSC 

0.71 

0.55 

0.53 

Tuned LISTA 

0.74 

0.62 

0.57 

Deep ^ 0 -Regularized 

0.72 

0.58 

0.52 

Deep M-Sparse (M =10) 

0.72 

0.57 

0.53 

Deep M-Sparse (M = 20) 

0.69 

0.54 

0.51 

Deep M-Sparse (M = 30) 

0.73 

0.57 

0.52 


Table 5: Classification error rate (%) comparison of all meth¬ 
ods on the AVIRIS Indiana Pines dataset 


Method 

p= 128 

p = 256 

p = 5l2 

TDSC 

15.55 

15.27 

15.21 

Tuned LISTA 

16.12 

16.05 

15.97 

Deep ^ 0 -Regularized 

15.20 

15.07 

15.01 

Deep M-Sparse (M =10) 

13.77 

13.56 

13.52 

Deep M-Sparse (M = 20) 

14.67 

14.23 

14.07 

Deep M-Sparse (M = 30) 

15.14 

15.02 

15.00 


can generate remarkably improved results Especially in 
Table 5, when M = 10, the error rate is around 1.5% lower 
than that of M = 30. Note that in the AVIRIS Indiana Pines 
dataset, the training data volume is much smaller than that 
of MNIST. In that way, we conjecture that it might not be 
sufficiently effective to depend the training process fully on 
data; instead, to craft a stronger sparsity prior by smaller M 
could help learn more discriminative features^. Such a be¬ 
havior provides us with a important hint to impose suitable 
structural priors to deep networks. 

Applications on Clustering 

For clustering, we evaluate our methods on the COIL 20 
and the CMU PIE dataset (Sim, Baker, and Bsat 2002). Two 
state-of-the-art methods to compare are the jointly optimized 
sparse coding and clustering method proposed in (Wang et 
al. 2015a), as well as the graph-regularized deep clustering 
method in (Wang et al. 2015b)^. The overall error rates are 
compared in Tables 6 and 7. 

Note that the method in (Wang et al. 2015b) incorporated 
Laplacian regularization as an additional prior while the oth¬ 
ers not. It is thus no wonder that this method often performs 

"^To get a good estimate of M, one might first try to perform 
(unsupervised) sparse coding on a subset of samples. 

^Interestingly, there are a total of 16 classes in the AVIRIS Indi¬ 
ana Pines dataset When p= 128, each class has on average 8 “dic¬ 
tionary atoms” for class-specific representation. Therefore M = 10 
approximately coincides with the sparse representation classifica¬ 
tion (SRC) principle (Wang, Nasrabadi, and Huang 2015) of forc¬ 
ing sparse codes to be compactly focused on one class of atoms. 

^Both papers train their model under both soft-max and max- 
margin type losses. To ensure fair comparison, we adopt the former, 
with the same form of loss function as ours. 


better than others. Even without any graph information uti¬ 
lized, the proposed deep encoders are able to obtain very 
close performances, and outperforms (Wang et al. 2015b) in 
certain cases. On the COIL 20 dataset, the lowest error rate 
is reached by the Deep M-Sparse (M = 10) Encoder, when 
p = 512, followed by the Deep -Regularized Encoder. 


Table 6: Clustering error rate (%) comparison of all methods 
on the COIL 20 dataset 


Method 

p = 128 

p = 256 

p = 5l2 

(Wang et al. 2015a) 

17.75 

17.14 

17.15 

(Wang et al. 2015b) 

14.47 

14.17 

14.08 

Deep ^ 0 -Regularized 

14.52 

14.27 

14.06 

Deep M-Sparse (M =10) 

14.59 

14.25 

14.03 

Deep M-Sparse (M = 20) 

14.84 

14.33 

14.15 

Deep M-Sparse (M = 30) 

14.77 

14.37 

14.12 


Table 7: Clustering error rate (%) comparison of all methods 
on the CMU PIE dataset 


Method 

p= 128 

p = 256 

p = 5l2 

(Wang et al. 2015a) 

17.50 

17.26 

17.20 

(Wang et al. 2015b) 

16.14 

15.58 

15.09 

Deep ^ 0 -Regularized 

16.08 

15.72 

15.41 

Deep M-Sparse (M =10) 

16.77 

16.46 

16.02 

Deep M-Sparse (M = 20) 

16.44 

16.23 

16.05 

Deep M-Sparse (M = 30) 

16.46 

16.17 

16.01 


On the CMU PIE dataset, the Deep -Regularized En¬ 
coder leads to competitive accuracies with (Wang et al. 
2015b), and outperforms all Deep M-Sparse Encoders with 
noticeable margins, which is different from other cases. 
Pervious work discovered that sparse approximations over 
CMU PIE had significant errors (Yang, Yu, and Huang 
2010), which is also verified by us. Therefore, hardcoding 
exact sparsity could even hamper the model performance. 
Remark: From those experiments, we gain additional in¬ 
sights in designing deep architectures: 

• If one expects the model to explore the data structure by 
itself, and provided that there is sufficient training data, 
then the Deep -Regularized Encoder (and its peers) 
might be preferred as its all parameters, including the de¬ 
sired sparsity, are fully learnable from the data. 

• If one has certain correct prior knowledge of the data 
structure, including but not limited to the exact sparsity 
level, one should choose Deep M-Sparse Encoder, or 
other models of its type that are designed to maximally 
enforce that prior. The methodology could be especially 
useful when the training data is less than sufficient. 

We hope the above insights could be of reference to many 
other deep learning models. 







Conclusion 

We propose Deep Iq Encoders to solve the sparse ap¬ 
proximation problem. Rooted in solid iterative algorithms, 
the deep regularized encoder and deep M-sparse encoder 
are developed, each designed to solve one typical formula¬ 
tion, accompanied with the introduction of the novel HELU 
neuron and maxM pooling/unpooling. When applied to spe¬ 
cific tasks of classification and clustering, the models are 
optimized in an end-to-end manner. The latest deep learn¬ 
ing tools enable us to solve them in a highly effective and 
efficient fashion. They not only provide us with impressive 
performances in numerical experiments, but also inspire us 
with important insights into designing deep models. 
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